S : An Efficient Shared Scan Scheduler on MapReduce Framework

نویسندگان

Lei Shi

Xiaohui Li

Kian-Lee Tan

چکیده

Hadoop, an open-source implementation of MapReduce, has been widely used for data-intensive computing. In order to improve performance, multiple jobs operating on a common data file can be processed as a batch to share the cost of scanning the file. However, in practice, jobs often do not arrive at the same time, and batching them means longer waiting time for jobs that arrive earlier. In this paper, we propose S – a novel Shared Scan Scheduler for Hadoop – which allows sharing the scan of a common file for multiple jobs that may arrive at different time. Under S , a job is split into a sequence of (independent) sub-jobs, each operating on a different portion of the data file; moreover, multiple sub-jobs (from different jobs) that access a common portion of a data file can be processed as a batch to share the scan of the accessed data. S operates as follows: at any time, the system may be processing a batch of sub-jobs (that access the same portion of data); at the same time, there are sub-jobs waiting in a job queue; as a new job arrives, its sub-jobs can be aligned with the waiting jobs in the queue; once the current batch of sub-jobs completes processing, the next batch of sub-jobs (which may include sub-jobs from newly arrived jobs) can be initiated for processing. In this way, an arriving job does not need to wait for a long time to be processed. We have implemented our S approach in Hadoop, and our experimental results on a cluster of over 40 nodes show that S outperforms the naı̈ve no-sharing scheme and the filebased shared-scan approach.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Shared Cluster Scheduling: a Fair and Efficient Protocol

In this work we focus on the problem of resource allocation in a shared cluster used for data-intensive scalable computing. Specifically, we target the open-source implementation of the MapReduce framework, Hadoop, and design a new scheduling algorithm that caters both to a fair and efficient utilization of a shared cluster. Our scheduler, labelled FSP, achieves both goals by “focusing” the res...

متن کامل

A Throughput Driven Task Scheduler for Batch Jobs in Shared MapReduce Environments

MapReduce is one of the most popular parallel data processing systems, and it has been widely used in many fields. As one of the most important techniques in MapReduce, task scheduling strategy is directly related to the system performance. However, in multi-user shared MapReduce environments, the existing task scheduling algorithms cannot provide high system throughput when processing batch jo...

متن کامل

Survey on Task Assignment Techniques in Hadoop

MapReduce is an implementation for processing large scale data parallelly. Actual benefits of MapReduce occur when this framework is implemented in large scale, shared nothing cluster. MapReduce framework abstracts the complexity of running distributed data processing across multiple nodes in cluster. Hadoop is open source implementation of MapReduce framework, which processes the vast amount o...

متن کامل

Hadoop Map Reduce Job Scheduler Implementation and Analysis in Heterogeneous Environment

Hadoop MapReduce is one of the popular framework for BigData analytics. MapReduce cluster is shared among multiple users with heterogeneous workloads. When jobs are concurrently submitted to the cluster, resources are shared among them so system performance might be degrades. The issue here is that schedule the tasks and provide the fairness of resources to all jobs. Hadoop supports different s...

متن کامل

Evaluating the Job Performance using DyScale Scheduler and MapReduce in Hadoop framework

As there has been a massive growth in the social media in all the aspects from over a span of ten years, the amount of photos being uploaded to the Internet has been increased. Photos are being shared, downloaded and uploaded in surpass quantity through many online services like WhatsApp, Facebook, Instagram to name a few. But the applications making use of this uploaded photos are very few. He...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

S : An Efficient Shared Scan Scheduler on MapReduce Framework

نویسندگان

چکیده

منابع مشابه

Shared Cluster Scheduling: a Fair and Efficient Protocol

A Throughput Driven Task Scheduler for Batch Jobs in Shared MapReduce Environments

Survey on Task Assignment Techniques in Hadoop

Hadoop Map Reduce Job Scheduler Implementation and Analysis in Heterogeneous Environment

Evaluating the Job Performance using DyScale Scheduler and MapReduce in Hadoop framework

عنوان ژورنال:

اشتراک گذاری